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Abstract 

Count-Min Sketch (TJ is a widely adopted algorithm for approximate 
event counting in large scale processing. However, the original version of 
the Count-Min-Sketch (CMS) suffers of some deficiences, especially if one 
is interested by the low-frequency items, as in text-mining related tasks. Sev¬ 
eral variants of CMS a have been proposed to compensate for the high rela¬ 
tive error for low-frequency events, but the proposed solutions tend to correct 
the errors instead of preventing them. In this paper, we propose the Count- 
Min-Log sketch, which uses logarithm-based, approximate counters 001 
instead of linear counters to improve the average relative error of CMS (with 
conservative update) at constant memory footprint. 


□ 

1 Introduction 

Count-Min Sketch |[Q (CMS) is a widely adopted algorithm for approximate event 
counting in large scale processing. With proved bounds in terms of mean absolute 
error and confidence, one can easily design a constant size sketch as an alternative 
to expensive exact counting for a setting where the total number of event types is 
approximately known. 

CMS is used in many applications, often with a focus on high frequency events 
ll2l . However, in the domain of text-mining, highest frequency events arc often of 
low interest: frequent words are often grammatical, highly polysemous or without 
any interesting semantics, while low-frequency words are more relevant. 

As a matter of fact, a common regularizations in text-mining consist in com¬ 
puting the TF-IDF |[8| (equation[l]) for term/document relevance, Pointwise Mutual 
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Information Q (equation [2]) or Log-likelihood Ratio (3l between two words in or¬ 
der to estimate the importance of their cooccurrence. 



(la) 


tfidfij = tfij ■ idfi 


(lb) 



(2a) 


In TF-IDF, \D\ is the number of documents in the corpus, tfij is the frequency 
of term ti in document dj and \d 3 : ti £ dj\ is the number of documents contain¬ 
ing the term ti. In PMI, p(i,j) is the probability that words i and j appear in a 
cooccurrence window and p(i) the probability to find i in the corpus. 

Those formulas show that higher frequency words will induce a relatively lower 
value at the end. Moreover, they all use a logarithm, which shows that only the 
order of magnitude is important. 

The original version of the Count-Min-Sketch (CMS) suffers from some deh- 
ciences, especially when one is interested not by the high frequency events, but by 
the low frequency ones, such as these tasks above. A variant of CMS (51 has been 
proposed to compensate for the high relative error for low-frequency events, but 
the solutions explored tend to correct the errors instead of preventing them. 

2 Count-Min-Log Sketch with Conservative Update 

We propose a variant of the Count-Min Sketch with conservative update that uses 
logarithm-based, approximate counters |(7] 0 instead of linear counters to improve 
the average relative error of CMS at constant memory footprint. The principle is 
to use exactly the same structure than Count-Min Sketch with conservative update, 
replacing only the classical binary counting cells by log counting cells. With this 
modification, the Update and Query procedures becomes respectively algorithm 
[l]and algorithm [2] 

The rationale behind this variant lies as follows: 

1. The original CMS uses as many bits (usually 32 or 64) to represent low val¬ 
ues than high values. However, in skewed distributions like those of natural 
languages, low values are much more frequent than high values, and they 
use less bits. As a consequence, one can consider using smaller counters, 
and thus increasing the number of counters for the same storage space. 

2. Furthermore, and this is especially the case for highly skewed distributions 
like the ones of Ziphan data, Count-Min Sketch estimates high frequency 
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Algorithm 1 Count-Min-Log Sketch UPDATE 

Input: sketch width w, sketch depth d , log base b > 1, independent hash func¬ 
tions h\„d : U -> {1 ... u>} 
l: function IncreaseDecision(c) 

2: return True with probability b~ c , else False 

3: end function 
4: function UPDATE(e) 

5: cf- mini< fc < d sk[k, h k (e)\ 

6: if IncreaseDecision(c) then 

7: for k -t— 1... d do 

8: if sk[k , h k (e)\ = c then 

9: sk[k, h k (e)j <— c + 1 

10: end if 

11: end for 

12: end if 

13: end function 


Algorithm 2 Count-Min-Log Sketch QUERY 

Input: sketch width w, sketch depth d , log base b > 1, independent hash func¬ 
tions h\,,d : U {1... w} 
l: function PointValue(c) 

2: if c = 0 then 

3: return 0 

4: else 

5: return b c ~ l 

6 : end if 

7: end function 
8: function Value(c) 

9: if c < 1 then 

10: return PointValue(c) 

ll: else 

12: V -<-POINTVALUE(c + 1) 

13: return 

1—X 

14: end if 

15: end function 

16: function QuERY(e) 

17: c t- mini< fc < d sk[k , h k (e)\ 

18: return Value(c) 

19: end function 


3 






events very well (Count-Min Sketch with Conservative Update is even better 
in this regard), but very poorly low-frequency events. As a consequence, 
estimates of Pointwise Mutual Information, for instance, are largely off for 
low frequency items, which may cause several problems for large scale NLP 
tasks. 

These observations led to a first conclusion, shared by |[5]| , which is that for 
NLP tasks, the error metric that should be used on Point Query estimations is the 
Average Relative Error (ARE), not the Root Mean Square Error (RMSE). This led 
in turn to the following hypothesis: replacing the linear counters by logarithmic 
counters should incur, under some conditions, an improvement over the ARE. 

Another way to evaluate the precision of an approximate counting sketch for 
NLP task consists in directly measuring the error on the Pointwise Mutual Infor¬ 
mation. 


3 Empirical evaluation 

3.1 Data 

We have verified this hypothesis empirically in the following setting: we count 
unigrams and bigrams of 500k words of the 20newsgroups corpus (§]. The small 
corpus analyzed contains 233k counted elements, composed of 183k bigrams and 
50k unigrams. 

In the following, the ideal perfect count storage size corresponds, for a given 
number of elements, at the minimal amount of memory to store them perfectly, in a 
ideal settings. A high-pressure setting corresponds to a setting where the memory 
footprint is lower than the ideal perfect count storage size for the same number of 
elements. 

3.2 Variants 

We compare the estimates of three sketches: 

CMS-CU is the classical linear Count-min Sketch with Conservative Update, 

CMLS16-CU is the Count-min-log Sketch with Conservative Update using a log¬ 
arithmic base of 1.00025 and 16bits counters, and 

CMLS8-CU is the Count-min-log Sketch with Conservative Update using a loga¬ 
rithmic base of 1.08 and 8bits counters. 

3.3 Error on counts 

The results for the Average Relative Error on simple counts are shown on figure [T] 
The vertical line indicates the storage needed to memorize perfectly all the counts 
(the extra memory required for accessing the counters is not taken into account). 
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Sketch Name 

CMLS16-CU 
-A- CMLS8-CU 
CMS-CU 
Sketch Height 

.8 


Figure 1: Average Relative Error of estimated counts with Count-Min Sketch 
(CMS-CU, blue lines), Count-Min-Log 16bits (CMLS16-CU, red lines) and 
Count-Min-Log 8bits (CMLS16-CU, green lines). The bold vertical line corre¬ 
sponds to the ideal perfect counts storage size. 


This experiments show that before the perfect storage mark, the estimate error 
of the CMLS16-CU is approximately 2 to 4 times lower than the error of the esti¬ 
mate of the CMS-CU. The CMLS8-CU error improvement over CMS-CU is in the 
range of 7 to 12 times, however, the CMLS8-CU reaches a minimal ARE of 1CU 1 " 5 
and stops improving, due to the residual error caused by approximate counting. 

3.4 Error on PMI 

In a second step, we compute the Pointwise Mutual Information of the bigrams, 
and the error between the estimated PMI using counts from the sketch versus using 
the exact counts. The results for RMSE on estimated PMI are illustrated in figure 

E 

These results show that, with sketches near or smaller than the theoretical size 
of a perfect storage, CMLS16-CU (respectively CMLS8-CU) outperforms CMS- 
CU by a factor of about 4 (resp. 10) on the RMSE of the PMI. 

The histograms of PMI values for each sketch are illustrated by figure [3] show¬ 
ing that for equivalent storage, the CMS-CU presents a very distorted histogram on 
the right part (the interesting part for NLP tasks), while the CMLS8-CU is much 
closer to the reference. 
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Sketch Name 

CMLS16-CU 
-A- CMLS8-CU 
CMS-CU 
Sketch Height 


Figure 2: Root Mean Square Error of estimated PMI with Count-Min Sketch 
(CMS-CU, blue line), Count-Min-Log 16bits (CMLS16-CU, red line) and Count - 
Min-Log 8bits (CMLS8-CU, green line). The bold vertical line corresponds to the 
ideal perfect counts storage size. 



Figure 3: Flistograms of PMI values estimated for Count-Min Sketch (CMS-CU, 
grey area) and Count-Min-Log 8bits (CMLS8-CU, green area) sketches with 32kb 
storage, 2 levels. The area in red is the reference PMI computed with exact counts. 
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4 Conclusion and perspectives 


We have proposed a simple variant of the classical Count-Min-Sketch with Conser¬ 
vative Update which present significant improvement for counts Average Relative 
Error and RMSE on the Pointwise Mutual Information. While the gain is always 
clear for high pressure setups, when the available storage is less or equal the ideal 
storage size, the residual error due to the approximate counting is an absolute lower 
bound of the error, which can be hit at different sketch sizes depending on the log¬ 
arithm base. 

The next steps of this work are the following: 

• Compare results of PMI estimations only for interesting values (i.e. over a 
given threshold), since we have observed in figure [3]that CMS-CU seems to 
be particularly far from the reference on the right side of the histogram. 

• Evaluate the speed difference of our variant compared to CMS-CU. 

Additionally, we are investigating two other directions: 

1. Hierarchical storage cells with more cells to store least significant bits and 
less cells for most significants bits. 

2. Probabilistic update rule: we have observed that the ratio between smallest 
and second smallest estimates is correlated with the error. We want to try a 
probabilistic update rule that will take this ratio into account. 
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